Methods for identifying large subsets of differentially expressed genes based on multivariate microarray data analysis

ABSTRACT

The present invention provides multivariate methods for identifying differentially expressed genes based on microarray expression data. An improved random search procedure involving certain probability distances is provided. The methods of this invention implement a successive elimination procedure to remove smaller subsets resulted from each step of the random search thereby establishing a larger set of differentially expressed genes.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates in general to statistical analysis ofmicroarray data generated from nucleotide arrays. Specifically, thepresent invention relates to identification of differentially expressedgenes by multivariate microarray data analysis. More specifically, thepresent invention provides an improved multivariate random search methodfor identifying large sets of genes that are differentially expressedunder a given biological state or at a given biological locale ofinterest according to the values of a probability distance calculatedfor numerous subsets of genes. The method of the invention provides asuccessive elimination procedure to remove smaller subsets resulted fromeach step of the random search thereby establishing a larger set ofdifferentially expressed genes.

2. Description of the Related Art

Gene expression analyses based on microarray data promises to open newavenues for researchers to unravel the functions and interactions ofgenes in various biological pathways and, ultimately, to uncover themechanisms of life in diversified species. A significant objective insuch expression analyses is to identify genes that are differentiallyexpressed in different cells, tissues, organs of interest or atdifferent biological states. So identified, a set of differentiallyexpressed genes associated with a certain biological state, e.g., tumoror certain pathology, may point to the cause of such tumor or pathology,and thereby shed light on the search of potential cures.

In practice, however, gene expression studies are hampered by manydifficulties. For example, poor reproducibility in microarray readingscan obscure actual differences between normal and pathological cells orcreate false positives and false negatives. The tension between theextremely large number of genes present (hence high dimensionality ofthe feature space) and the relatively small number of measurements alsoposes serious challenges to researchers in making accurate diagnosticinferences.

Existing methods for selecting differentially expressed genes aretypically univariate, not taking into account the information oninteractions among genes. As appreciated by an ordinary skilledmolecular biologist, genes do not operate in isolation—activation of onegene may trigger changes in the expression levels of other genes. Thatis, genes may be involved in one or more pathways or networks.Therefore, determination of differentially expressed genes calls forconsideration of covariance structure of the microarray data, inaddition to, for example, mean expression levels. In this regard,however, application of well-established statistical techniques formultidimensional variable selection encounters much difficulty. This isso because, in one aspect, the small number of independent samples andthe presence of outliers make the estimates on selected variablesunstable for large dimensions. In other words, only small sets of genescan be meaningfully considered while a relatively large number of genesare potentially differentially expressed. It is generally impossible tocompare all gene subsets and find the optimal one because the number ofpossible gene combinations is prohibitively large. On the other hand, ifa global optimum could be found, it might be overly specific to atraining sample due to overfitting. Thus, it remains a significantchallenge to scale methods for identifying differentially expressedgenes to deal with microarray data of high dimensional space.

Therefore, there is a need to address the difficulties in applyingmultivariate analysis to microarray data—a need to provide methods foridentifying differentially expressed genes based on gene expression datawith high dimensional feature space.

SUMMARY OF THE INVENTION

It is therefore an object of this invention to provide multivariatemethods for analyzing microarray gene expression data of highdimensional space thereby identifying differentially expressed genes.Particularly, it is an object of this invention to provide methods foridentifying larger sets of differentially expressed genes by successiveeliminating smaller subsets of genes identified from each step of therandom search procedure.

In accordance with the present invention, there is provided methods foridentifying a set of genes from a multiplicity of genes whose expressionlevels at a first and a second state, in a first and a second tissue, orin a first and a second types of cells are measured in replicates usingone or more nucleotide arrays, thereby generating a first plurality ofindependent measurements of the expression levels for the first state,tissue, or type of cells and a second plurality of independentmeasurements of the expression levels for the second state, tissue, ortype of cells. The methods comprise: (a) identifying a quality functioncapable of evaluating the distinctiveness between the first pluralityand the second plurality; (b) forming a first predetermined number ofpermutations from the first and the second pluralities, dividing thepermutations into a first permutated plurality and a second permutatedplurality, corresponding in size, to the first and second plurality,respectively, and identifying groups of genes the size of which is asecond predetermined number, wherein the values of the quality functionfor the group of genes in the first permutated and second permutatedpluralities attain the maximum; (c) determining, from the first andsecond permutated pluralities, the top ad percentile of the nulldistribution based on a quantitative characteristic of the groups ofgenes; (d) identifying, based on the first and second pluralities, asubset of genes the size of which is the second predetermined number,wherein the values of the quality function for the subset of genes inthe first and second pluralities attain the maximum; (e) adding to theset of genes, the subset, if the value of the quantitativecharacteristic associated with the subset exceeds the top α^(th)percentile of the null distribution; and (f) removing from the first andsecond pluralities, all measurements on the subset, if the maximum valueof the quality function associated with the subset exceeds the topα^(th) percentile of the null distribution, and repeating steps (d)-(f)until no more measurements are left in the first and second pluralitiesor the value of the quantitative characteristic associated with thesubset does not exceed the top α^(th) percentile of the nulldistribution.

According to the present invention, in certain embodiments, the statesis may be biological states, physiological states, pathological states,and prognostic states. In other embodiments, the tissues may be normallung tissues, cancer lung tissues, normal heart tissues, pathologicalheart tissues, normal and abnormal colon tissues, normal and abnormalrenal tissues, normal and abnormal prostate tissues, and normal andabnormal breast tissues. In yet other embodiments, the types of cellsmay be normal lung cells, cancer lung cells, normal heart cells,pathological heart cells, normal and abnormal colon cells, normal andabnormal renal cells, normal and abnormal prostate cells, and normal andabnormal breast cells. In still other embodiments, the types of cellsmay be cultured cells and cells isolated from an organism.

In one embodiment of this invention, the quality function is representedby a probability distance between random vectors. In another embodiment,the probability distance function is selected from the group consistingof the Mahalanobis distance and the Bhattacharya distance. In yetanother embodiment, the probability distance function is defined as:N(μ,ν)=2∫_(R) _(d) ∫_(R) _(d) L(x,y)dμ(x)dν(y)−∫_(R) _(d) ∫_(R) _(d)L(x,y)dμ(x)dμ(y)−∫_(R) _(d) ∫_(R) _(d) L(x,y)dν(x)dν(y)where μ and ν are two probability measures defined on the Euclideanspace, and L(x, y) is a strictly negative definite kernel. In stillanother embodiment, the negative definite kernel is combined with theEuclidean distance between x and y to form a composite kernel function.

According to one embodiment, the quantitative characteristic is selectedfrom the group consisting of an associated probability distance, a testset classification rate, and a cross-validation classification rate.

According to another embodiment, the formation of the permutationsfurther comprises: (i) shifting the measurements in the first and secondpluralities such that the marginal means thereof share the same truemean; and (ii) randomly permuting the resulting shifted measurementsthereby forming a null-distribution of permutations.

According to yet another embodiment, the identifying further comprises:(i) calculating the values of the quality function for the subset ofgenes in the first and second pluralities thereby evaluating thedistinctiveness of the first and second pluralities; and (ii)substituting a gene in the subset with one outside of the subset,thereby generating a new subset, and repeating step (i), keeping the newsubset if the distinctiveness increases and the original subset ifotherwise; and (iii) repeating steps (i) and (ii) for a fourthpredetermined number of times.

According to still another embodiment, the identifying furthercomprises: (i) randomly dividing the first and the second pluralitiesinto v groups of an approximate equal size; (ii) removing one of the vgroups from the first and second pluralities and identifying, from theresulting reduced first and second pluralities, a subset of genes forwhich the value of the quality function attains the maximum; and (iii)repeating step (ii) for each of the v groups thereby obtaining v subsetsof genes.

In various embodiments of the invention, the nucleotide arrays may bearrays having spotted thereon cDNA sequences and/or arrays havingsynthesized thereon oligonucleotides.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 shows the properties of the optimal subsets of genes identifiedin a computer simulation study using a random search method with asuccessive elimination procedure according to one embodiment of theinvention.

FIG. 2 shows the properties of the optimal subsets of genes identifiedin an expression analysis of colon cancer cells using a random searchprocedure with a successive elimination procedure according to oneembodiment of the invention.

FIG. 3 shows the estimates of the null-distributions based on theassociated probability distance (the top panel), the test setclassification rate (the bottom panel, the curve on the left), and thecross validation classification rate (the bottom panel, the curve on theright) for the 5-element optimal subset of genes in a “no-difference”dataset generated by a resampling procedure according to one embodimentof the invention.

DETAIL DESCRIPTIONS OF DISCLOSURE

Definition

As used herein, the term “microarray” refers to nucleotide arrays;“array,” “slide,” and “chip” are used interchangeably in thisdisclosure. Various kinds of nucleotide arrays are made in research andmanufacturing facilities worldwide, some of which are availablecommercially. There are, for example, two kinds of arrays depending onthe ways in which the nucleic acid materials are spotted onto the arraysubstrate: oligonucleotide arrays and cDNA arrays. One of the mostwidely used oligonucleotide arrays is GeneChip™ made by Affymetrix, Inc.The oligonucleotide probes that are 20- or 25-base ong are synthesizedin silico on the array substrate. These arrays tend to achieve highdensities (e.g., more than 40,000 genes per cm²). The cDNA arrays, onthe other hand, tend to have lower densities, but the cDNA probes aretypically much longer than 20- or 25-mers. A representative of cDNAarrays is LifeArray made by Incyte Genomics. Pre-synthesized andamplified cDNA sequences are attached to the substrate of these kinds ofarrays.

Microarray data, as used herein, encompasses any data generated usingvarious nucleotide arrays, including but not limited to those describedabove. Typically, microarray data includes collections of geneexpression levels measured using nucleotide arrays on biological samplesof different biological states and origins. The methods of the presentinvention may be employed to analyze any microarray data; irrespectiveof the particular microarray platform from which the data are generated.

Gene expression, as used herein, refers to the transcription of DNAsequences, which encode certain proteins or regulatory functions, intoRNA molecules. The expression level of a given gene refers to the amountof RNA transcribed therefrom measured on a relevant or absolutequantitative scale. The measurement can be, for example, an opticdensity value of a fluorescent or radioactive signal, on a blot or amicroarray image. Differential expression, as used herein, means thatthe expression levels of certain genes are different in differentstates, tissues, or type of cells, according to a predeterminedstandard. Such standard maybe determined based on the context of theexpression experiments, the biological properties of the genes understudy, and/or certain statistical significance criteria.

The terms “vector,” “probability distance,” “distance,” “the Mahalanobisdistance,” “the Euclidean distance,” “feature,” “feature space,”“dimension,” “space,” “type I error,” “type II error,” “ROC curve,”“permutation,” “random permutation,” and “null distribution” are to beunderstood consistently with their typical meanings established in therelevant art, i.e. the art of mathematics, statistics, and any arearelated thereto. For example, a set of microarray data on p distinctgenes represents a random vector X=X₁, . . . , X_(p) with mutuallydependent components.

Random Search to Identify Subsets of Genes of a Predetermined Size

Suppose two tissues, types of cells, or biological states are ofinterest, one of which corresponds to the normal physiology while theother implicates certain pathology such as tumor. The distinctiveness ofthese two tissues, types of cells, or states can be evaluated bymicroarray experiments in which the expression levels of all the genes(up to thousands measured on a single chip or slide as made possible bythe recent advances in the microarray manufacturing) are determined. Acollection of differentially expressed genes would therefore account, atthe genomic/genetic level, for the distinctiveness of the two tissues,type of cells, or states. Certain multivariate distances are employed toevaluate such distinctiveness according to this invention.For example, a probability distance and its nonparametric estimate maybe used in this context. Let μ and ν be two probability measures definedon the Euclidean space. Let L(x,y) be a strictly negative definitekernel, that is${{\sum\limits_{i,{j = 1}}^{s}{{L\left( {x_{i},x_{j}} \right)}h_{i}h_{j}}} \leq {0\quad{for}\quad{any}\quad x_{1}}},\ldots\quad,{x_{s}\quad{and}\quad h_{1}},\ldots\quad,h_{s},{{\sum\limits_{i = 1}^{s}h_{i}} = 0}$with equality if and only if all h_(i)=0. It can be shown that aprobability distance N(μ, ν) as defined below is a metric in the spaceof all probability measures on R^(d).N(μ,ν)=2∫_(R) _(d) ∫_(R) _(d) L(x,y)dμ(x)dν(y)−∫_(R) _(d) ∫_(R) _(d)L(x,y)dμ(x)dμ(y)−∫_(R) _(d) ∫_(R) _(d) L(x,y)dν(x)dν(y)Consider two independent samples, consisting of n₁ and n₂ observationsrespectively, represented by the d-dimensional vectors x₁, . . . ,x_(n1)and y₁, . . . , y_(n2). An empirical counterpart of N(μ, ν) may berepresented as follows $\begin{matrix}{\hat{N} = {N\left( {\mu_{n1},v_{n2}} \right)}} \\{= {{\frac{1}{n_{1}n_{2}}{\sum\limits_{i = 1}^{n1}{\sum\limits_{j = 1}^{n2}{2L\left( {x_{i},y_{j}} \right)}}}} -}} \\{{\frac{1}{n_{1}^{2}}{\sum\limits_{i = 1}^{n1}{\sum\limits_{j = 1}^{n1}{L\left( {x_{i},x_{j}} \right)}}}} - {\frac{1}{n_{2}^{2}}{\sum\limits_{i = 1}^{n2}{\sum\limits_{j = 1}^{n2}{{L\left( {y_{i},y_{j}} \right)}.}}}}}\end{matrix}$

A pertinent kernel function L needs to be chosen when the probabilitydistance N(μ, ν) is used. Appropriate choices include the Euclideandistance between ranks and a monotone function of the Euclidean distancesatisfying the condition of negative definiteness. Additionally, analternative class of kernel functions may be used to measure pairwisegene interaction.

Let x and y denote observations in two samples on a gene set and x^(r)and y^(r) denote the corresponding rank-adjusted observations. Considereither of these observations to be points in Euclidean space. Let S be ameasurable subset of R^(d). Define L_(S) by the rule L_(S)(x,y)=0 ifboth xεS and yεS and L_(S)(x,y)=1 otherwise. L_(S) is a negativedefinite kernel. Suppose, X_(i)εS, 1≦i≦r, and x_(i)∉S, r+1≦i≦s, then onewould have${\sum\limits_{i,{j = 1}}^{s}{\left( {1 - {L_{S}\left( {x_{i},y_{j}} \right)}} \right)h_{i}h_{j}}} = {{\sum\limits_{i,{j = 1}}^{r}{h_{i}h_{j}}} = {\left( {\sum\limits_{i,{j = 1}}^{r}h_{i}} \right)^{2} \geq 0.}}$Thus (1−L_(S)) is a positive definite kernel.

More generally, let ƒ(x) be a function from a space to the interval[0,1], and define L_(ƒ)(x,y)=max(ƒ(x),ƒ(y)), then L_(ƒ) is a negativedefinite kernel. Also, if one defines g_(α)(x,y)=0 provides both ƒ(x)>αand ƒ(y)>α and g_(α)(x,y)=1 otherwise, then, from the previousparagraph, g_(α) is a negative definite kernel. It follows from theequality L_(ƒ)(x,y)=∫₀ ¹g_(α)(x,y)dα that L_(ƒ) is negative definite.Since a negative definite kernel is unaffected by an arbitrary additiveshift, it is clear that L_(ƒ)(x,y)=max(ƒ(x),ƒ(y)) will be a negativedefinite kernel for any bounded function ƒ.

If w_(i) are positive weights and ƒ_(i), 1≦i≦d, are functions from to[0,1], then $L = {\sum\limits_{i = 1}^{d}{w_{i}L_{f_{i}}}}$is also a negative definite kernel. From the foregoing derivations, onewould also have: if {ƒ_(i)} separates points, that is, ƒ_(i)(x)=ƒ_(i)(y)for all i implies x=y, then L is strictly negative definite.

Negative definite kernels of the type described above may be combinedwith the usual Euclidean distance to form composite kernel functions.For example, define a region function R_(q)(ν,ν)=q└qν┘+└qν┘ (here └·┘denotes the floor function, its value is the largest integer notexceeding the argument and q≧2 is an integer parameter). This functionis constant on each of the q² obtained by dividing the sides of the(0,1)² into q equal segments. Then the following kernels on the rankeddata may be defined:${{L_{1}\left( {x^{r},y^{r}} \right)} = \sqrt{\sum\limits_{g \in S}\left( {x_{g}^{r} - y_{g}^{r}} \right)^{2}}},\begin{matrix}{{L_{2}\left( {x^{r},y^{r}} \right)} = {{w_{1}{L_{1}\left( {x^{r},y^{r}} \right)}} +}} \\{w_{2}{\sum\limits_{{({g_{1},g_{2}})} \in S^{2}}\left( {1 - {I\left\{ {R_{q}\left( {x_{g_{1}}^{r},x_{g_{2}}^{r}} \right)} \right.}} \right.}} \\{\left. \left. {= {R_{q}\left( {y_{g_{1}}^{r},y_{g_{2}}^{r}} \right)}} \right\} \right),}\end{matrix}$where I is the indicator function. Then L₁ is the standard Euclideandistance and L₂ falls into the class described above. We choose theweights w₁ and w₂ to balance the two components of L₂ with respect totheir maximum values:${w_{1} = {{{1/d_{\max}}\quad{and}\quad w_{2}} = {1/\begin{pmatrix}d_{\max} \\2\end{pmatrix}}}},$where d_(max) is the maximum subset dimension under consideration. Thesecond component of the kernel will be insensitive to perturbation, yetpick up sets of genes that have similar expression levels across samplesin one tissue and different expression patterns in the two tissues.

In another alternative embodiment, a function L_(ƒ) is based on thecorrelation coefficient. Let x^(n) and y^(n) denote normalized data suchthat the tissue-specific sample mean and variance are zero and onerespectively. For each pair of genes g₁ and g₂, consider the functionƒ_(g) ₁ _(,g) ₂ (x^(n))=x_(g1) ^(n)x_(g2) ^(n). The correspondingnegative definite kernel L_(g1,g2) will detect differences incorrelation between the two tissues. For example, if the expressions ofg₁ and g₂ have correlation coefficient ρ in one tissue and areuncorrelated in the other, it follows from 2max(ρ,0)−max(ρ,ρ)−max(0,0)=|p| that the corresponding distance betweenthe tissues will be approximately equal to |ρ|.

A negative definite kernel may, in this embodiment, be defined as:${L_{3}\left( {x,y} \right)} = {{w_{1}{L_{1}\left( {x,y} \right)}} + {w_{2}{\sum\limits_{({g_{1},g_{2}})}{L_{g_{1},g_{2}}\left( {x,y} \right)}}}}$

The weights w₁ and w₂ may be chosen to balance the contribution of thetwo components. A distance based on L₃ will tend to pick up sets ofgenes with separated means and differences in correlation in the twosamples.

In various embodiments of this invention, once an aforementionedmultivariate distance is selected, it may be used to search for asubset(s) of genes that are differentially expressed between the twotissues, types of cells, or biological states as the correspondingvalues of the distance are maximized. The size of such subsets ispredetermined, which are typically small since they are limited by theavailable sample replicates. In theory, all subsets of a predeterminedsize need to be evaluated in terms of the adopted distance and the onethat provides a maximum distance should be chosen as the final set ofdifferentially expressed genes. In practice, however, the number ofpossible subsets exponentially increases with the total number of ge nesinvolved and, consequently, the exhaustive search procedures as well asthe branch-and-bound method (see, e.g., Fukunaga K., (1990),Introduction to Statistical Pattern Recognition, Academic Press, London,2nd.) become computationally prohibitive. Therefore, various stepwiserandom search procedures may be advantageously adopted according to thisinvention in identifying subsets of differentially expressed genes of apredetermined size.

In this connection, the search for a subset of genes with the bestdiscrimination between two tissues, type of cells, or states often turnsup overly-optimistic conclusions due to overfitting, i.e., findingoverly specific patterns that do not extend to new samples. To mitigatesuch local selection bias, cross validation techniques may be adopted inrandom searches according to this invention, an example procedure(Procedure 1) is provided as follows:

1. Randomly divide the data into v groups of an approximate equal size;

2. Drop one of the n groups and find the optimal subset of genes usingonly the data from v−1 group, based on the evaluation of the applicableprobability distance.

3. Repeat step 2 in succession for each of the groups, obtaining voptimal subsets.

4. Combine these sets by selecting the genes with the highestfrequencies of occurrences.

In alternative embodiments, multiple local searches may be performed andthen the resulting locally sub-optimal subsets may be integrated suchthat a final set of differentially expressed genes may be identified(e.g., by including the genes with the highest frequency of occurrencesin the locally sub-optimal subsets).

Establishing Larger Sets of Genes Based on the Identified SmallerSubsets

As discussed above, random search procedures based on certainprobability distances may be utilized to identify a subset ofdifferentially expressed genes of a predetermined size. And, since apredetermined size as such often is limited by the scarcity of thesample size (especially when the total number of genes is large and thedimensionality of the microarray data is high), it is desirable to finda way to enlarge the size of the set of differentially expressed genesidentified.

In one embodiment of this invention, a successive selection procedure isadopted to eliminate groups of genes after each run of the random searchprocedure, until no more subsets of genes can be found that satisfy thesearch criteria. The final set of differentially expressed genes wouldthen include all the removed genes at each step. Essential to thismethod is the formulation of a stopping rule at each step.

The formulation of such an appropriate stopping rule turns on theevaluation of the properties of an optimal set of genes in a“no-difference” data set. Various quality functions may be used in thiscontext to provide a model to evaluate such properties. For example,certain multivariate distances are used as the quality function invarious embodiment of this invention. The selection process based on theapplication of such multivariate quality functions would necessarily beinfluenced by the covariance structure of the microarray data. Thus, the“no-difference” baseline data (i.e., corresponding to thenull-distribution) ought to be generated in such a way that thecovariance data structure is preserved. The following two-step“resampling” process (Procedure 2) meets such requirement. The firststep ensures that the marginal means of the two data sets (may have beenobtained from two tissues, types of cells, or biological states) havethe same true mean. And, the second step mimics the biologicalvariability through permutation.

Denote the adjusted fluorescence level for gene i, i=1 . . . , p in thetwo tissues by X_(ij), j=1 . . . , n₁ and Y_(ij), j=1 . . . , n₂,respectively.

1. For each gene i, i=1, . . . ,p shift the values from the two datasets so they are centered at the overall mean for this gene, that is${X_{ij}^{*} = {X_{ij} - {{\overset{\_}{X}}_{i}.{+ \frac{n_{1}{{\overset{\_}{X}}_{i}.{+ n_{2}}}{{\overset{\_}{Y}}_{i}.}}{n_{1} + n_{2}}}}}},{Y_{ij}^{*} = {Y_{ij} - {{\overset{\_}{Y}}_{i}.{+ \frac{n_{1}{{\overset{\_}{X}}_{i}.{+ n_{2}}}{{\overset{\_}{Y}}_{i}.}}{n_{1} + n_{2}}}}}}$

2. Randomly permute the resulting n₁+n₂ vectors. The first n₁ and thelast n₂ vectors provide a random sample from the null-distribution.

Based on this permutation resampling scheme, the null-distributions ofvarious quantitative characteristics of the optimal gene set may beestimated. For example, the associated probability distance, crossvalidation classification rate (using a selected subset upon crossvalidation), and test set classification rate (using an independent testset) may be considered. A test set classification rate is calculated byclassifying each sample from an independent test set using the selectedsubset of genes and the entire training set and determining the rate ofthe correct classification. A cross-validation classification rate iscalculated by classifying each sample in the training set (in theabsence of a test set) using the selected subset of genes and the restof the training set and determining the rate of the correctclassification. Generally, the test set classification rate may be mostdesirable but, due to the scarcity of samples, an appropriate test setis often unavailable. In such situations, the between-tissue distanceassociated with gene sets may be a good and stable proxy for theclassification rate.

According to a particular embodiment, a probability distance-basedsuccessive-selection procedure is adopted in selecting a subset of genesthat are differentially expressed in two tissues, type of cells, orbiological states, as outlined below (Procedure 3). The successiveselection based on cross-validation or test set classification rates maybe similarly adopted in connection with random searches in alternativeembodiments of this invention.

The following procedure (Procedure 3) starts with the selection of asubset of genes with a size k and requires a significance level a fordefining a percentile of the null-distribution of the data sets.

1. Form m independent permutation samples of sizes n1 and n2,respectively, from n1+n2 observations (arrays/slides). For each of the mpermutation samples, find an optimal k-element subset of genes for whichthe associated probability distance attains its maximum. Estimate fromthe permutation samples the top α^(th) percentile D_(α) of the baselinedistribution of the optimal distance (referred to as thenull-distribution).

2. Returning to the original two data set setting, find the k-elementoptimal set of genes for which the associated probability distanceattains its maximum and denote it by G₁. If the associated probabilitydistance D(G₁)>D_(α), then continue, otherwise stop the search.

3. In the t^(th) iteration, discard sets G₁, . . . , G_(t-1) and findthe k-element optimal set G_(t) from the remaining genes. If theassociated distance D(G_(t))>D_(α), then continue with this step (nextiteration), otherwise proceed to step 4.

4. The final set of differentially expressed genes are defined by theunion ∪_(j=1) ^(t)G_(j).

Computer Simulation of the Improved Random Search

A simulation study was performed to evaluate the improved random searchmethod with the successive elimination procedure. A total of 1000 geneswas divided into subsets of equal size 20. In the first data set, nodifferential expression was imposed, and hence any difference shownwould be due to the within-tissue “biological variability.” In thesecond data set, one of the subsets (including 20 mutually dependentexpression signals) was set to be differentially expressed with a ratioof two. The correlation structure was kept the same in the two datasets. Further, an independent test set of 100 observations (with equalproportions of the two hypothetical tissues) was simulated for the twodata sets in order to estimate the true classification rate of theselected gene sets.

A cross-validated random search was performed in accordance withProcedure 1 supra. Particularly, step 2 of Procedure 1 was carried outin the following details (Procedure 4):

1. Randomly select k genes to form the initial approximation; calculatethe associated probability distance between the two data sets for thissubset of genes.

2. Replace at random one gene from the current subset with a geneoutside of the subset and calculate the value of the associatedprobability distance for the resulting new subset; if the distance islarger than that of the previous subset, keep the new subset and,otherwise, revert to the previous subset.

3. Repeat the process until a predetermined number M of iterations isreached.

In this particular simulation, M was set at 100,000. The successivesearch for 5-member optimal gene sets (k=5) was performed using the10-fold (v=10) cross-validated search procedure (Procedure 1). TheEuclidean distance was chosen for the kernel L(x, y) in the distancemeasure. For each of the successive optimal sets G_(i), thecorresponding optimal distance was recorded and the tissueclassification rate was estimated using both cross validation (using theselected gene set) and the independent test set. The results are shownin FIG. 1.

Referring to FIG. 1, the top panel shows the results for the data setthat had no difference imposed whereas the bottom panel shows theresults for the is data set that had a subset of 20 genes to bedifferentially expressed in the two hypothetical tissues. In bothpanels, the left y axis represents the associated probability distancewhile the right y axis denotes the classification rate based on theindependent test set (hence test set classification rate—“Class”) andthe classification rate based on cross validation using selected geneset (hence cross validation classification rate—“CV”). The x axis ofboth panel denotes the number of subsets of genes with a predeterminedsize of 5. As shown in both panels of FIG. 1, the estimate of the testset classification rate and that of the cross validation classificationrate are both highly variable for both data sets, whereas the associateddistance (Dist) is decreasing monotonically. Since the optimal sets wereselected based on the associated probability distance in thissimulation, the observed monotonicity confirms the ability of the randomsearch procedure of this invention to find an optimal subset.

To reduce the observed variability of the classification rate estimates,isotonic regression (see, Robertson T. et al., (1988) Order RestrictedStatistical Inference, Wiley, London) was performed to smooth thecorresponding curves and thereby generate the corresponding solid linesin FIG. 1, assuming the true rates to be non-increasing. The dottedhorizontal lines represent the level of the 99th percentile of thenull-distributions of the corresponding measures (i.e., the associatedprobability distance, the test set classification rate, and the crossvalidation classification rate); they were estimated by generating 100random permutation samples that mimic “no-difference” data in accordancewith Procedure 2 supra. For the first data set, referring to the toppanel of FIG. 1, all the observed curves lie entirely below their cutoffvalues, which demonstrates that the random search of the invention withthe successive elimination procedure correctly identifies nodifferentially expressed genes in the first data set.

For the second data set, since 20 genes were set to be differentiallyexpressed, the selection should stop after 4 iterations (i.e., toidentify 4 subsets of 5 genes). Referring to the bottom panel of FIG. 1,the distance curve (Dist) passes its cutoff level after the thirditeration, whereas the cross validation and the test set classificationcurves pass their cutoff levels after the fourth iteration. Thus, in thesimulated random search, the successive elimination procedures based onassociated distance, the test set classification rate, as well as thecross validation classification rate all performed satisfactorily inthis simulation, with the distance-based procedure slightly inferior tothe other two as it stopped early. However, the distance-based proceduredemonstrated superior stability and therefore it remains a powerfulalternative in certain embodiments of this invention. In summary, thedistance based cutoff identified 14/20=70% of the 20 differentiallyexpressed genes with a false positive (type I error) rate of only1/15=6.7%, while the two classification based cutoffs identified16/20=80% of the differentially expressed genes with a 4/20=20% falsepositive rate. The differentially expressed genes were marked with starsin the bottom panel of FIG. 1.

The invention is further described by the following examples, which areillustrative of the invention but do not limit the invention in anymanner.

EXAMPLE 1 A Source Code Segment Implementing Successive Selection andRe-Sampling

unit FExclude; interface uses  Windows, Messages, SysUtils, Classes,Graphics, Controls, Forms,  Dialogs, Spin, StdCtrls, Grids, Aligrid,EnhCBox, NumIO,  ComCtrls, Matrix, Vector; type  TExcludeForm =class(TForm)   NExcludeSteps: TSpinEdit;   Label1: TLabel;  RunEliminationButton: TButton;   ExcludeResult: TStringAlignGrid;  SaveButton: TButton;   SaveDialog: TSaveDialog;   Label2: TLabel;  Class1Box: TEnhComboBox;   Class2Box: TEnhComboBox;   H0Button:TButton;   H0repInput: TNumIO;   Label3: TLabel;   ExcludePB:TProgressBar;   RandomClusterButton: TButton;   RunFromDiffClButton:TButton;   procedure RunEliminationButtonClick(Sender: TObject);  procedure FormCreate(Sender: TObject);   procedureSaveButtonClick(Sender: TObject);   procedureExcludeResultKeyDown(Sender: TObject; var Key: Word;    Shift:TShiftState);   procedure ExcludeResultFixedRowClick(Sender: TObject;row:   Integer);   procedure ClassBoxDblClick(Sender: TObject);  procedure H0ButtonClick(Sender: TObject);   procedureRandomClusterButtonClick(Sender: TObject);   procedureRunFromDiffClButtonClick(Sender: TObject);  private   { Privatedeclarations }   procedure OneEliminationStep;   functionClassifyTestSets(filename: string; normal, centerdata:   boolean)     :string;  public   { Public declarations }   PermMat1, PermMat2: TMatrix;  RandDistCurves, ClassCurves, RandCV: TMatrix;   MeanDiff: TVector; end; var  ExcludeForm: TExcludeForm; implementation uses FDiffClust,DiffCluster, ClassificationF, readdata, RandomGen; {$R *.DFM} procedureTExcludeForm.RunEliminationButtonClick(Sender: TObject); var i, nsteps:integer;  testset: boolean; begin  nsteps:= NExcludeSteps.Value; ExcludeResult.RowCount:= nsteps+1;  testset:=(FileExists(Class1Box.Text)) and      (FileExists(Class2Box.Text)); ExcludePB.Position:= 0;  ExcludePB.max:= nsteps + 1;  for i:= 1 tonsteps do begin   try    ExcludePB.StepIt;    ExcludePB.Update;   BatchProcess:= True;   DiffClusterForm.FindDiffClusterButtonClick(self);   DiffClusterForm.DistButtonClick(self);    if notAssigned(ClassificationForm) then     ClassificationForm:=TClassificationForm.Create(self);    ifDiffClusterForm.AdjustType.ItemIndex <=2 then    ClassificationForm.AdjustOptions.ItemIndex:=     DiffClusterForm.AdjustType.ItemIndex;   ClassificationForm.CrossvalidButtonClick(self);   ExcludeResult.Row:= i;    ExcludeResult.CellAsInt[0,i]:= i;   ExcludeResult.Cells[1,i]:= DiffClusterForm.OutputMemo.Text;   ExcludeResult.Cells[2,i]:= ClassificationForm.   PCorrectOutput.Text+‘%’;    ExcludeResult.Cells[3,i]:=DiffClusterForm.DistOutput.Caption;    ExcludeResult.Cells[4,i]:=    FloatToStrF(MinFreqInDiffcl*100,ffFixed,3,1)+‘%’;   ExcludeResult.Cells[5,i]:=    FloatToStrF(MaxFreqInDiffcl*100,ffFixed,3,1)+‘%’;    if testset thenbegin     ExcludeResultCells[6,i]:=     ClassifyTestSets(Class1Box.Text,True,False)+‘%’;    ExcludeResult.Cells[7,i]:=     ClassifyTestSets(Class2Box.Text,False,False)+‘%’;    end;   OneEliminationStep;   finally    BatchProcess:= False;    end;   end;  ExcludePB.Position:= 0;  end;  procedureTExcludeForm.OneEliminationStep;  var i, gene: integer;   genelist:=TStringList;  begin   genelist: TStringList.Create;   try   genelist.CommaText:= DiffClusterForm.OutputMemo.Text;    for i:= 0 togenelist.Count−1 do begin    gene:= StrToInt(genelist[i]);   UseGeneInd[gene−1]:= 0;   end;  finally   genelist.Free;  end; end;function TExcludeForm.ClassifyTestSets; begin  with ClassificationFormdo begin   ClassifFileName.Text:= filename;   if centerdata then   RunButtonClick(H0Button)   else   RunButtonClick(RunEliminationButton);   if normal then   ActualClassOptions.ItemIndex:= 0   else   ActualClassOptions.ItemIndex:= 1;   Result:= PCorrectOutput.Text; end; end; procedure TExcludeForm.FormCreate(Sender: TObject); begin  ExcludeResult.AllowCutnPaste:= True;  ExcludeResult.PasteEditableOnly:= False; end; procedureTExcludeForm.SaveButtonClick(Sender: TObject); begin  ifSaveDialog.Execute then   ExcludeResult.SaveToFile(SaveDialog.FileName);end; procedure TExcludeForm.ExcludeResultKeyDown(Sender: TObject; varKey: Word;  Shift: TShiftState); begin   if (Key=67) then    if(Shift=[ssCtrl]) then    with ExcludeResult do   Contents2CSVClipboard(#9,Selection); end; procedureTExcludeForm.ExcludeResultFixedRowClick(Sender: TObject;  row: Integer);begin   if ExcludeResult.Cells[1,row]<>” then    with DiffClusterForm dobegin    OutputMemo.Clear;    OutputMemo.Text:=ExcludeResult.Cells[1,row];    end; end; procedureTExcludeForm.ClassBoxDblClick(Sender: TObject); begin   ifSaveDialog.Execute then begin    (Sender as TEnhComboBox).Text=SaveDialog.FileName;   end; end; procedureTExcludeForm.H0ButtonClick(Sender: TObject); var A, B: TMatrix;  ss1,ss2, s, i, j, k, nH0rep, nsteps: integer;  sampleperm: array of double; stepmin, stepmax: double;  testset: boolean; begin  caseDiffClusterForm.AdjustType.ItemIndex of   0: begin A:= normal; B:= polypend;   1: begin A:= renormal; B:= repolyp end;   2: begin A:= rnormal;B:= rpolyp end;   else Exit;  end;  testset:=(FileExists(Class1Box.Text)) and    (FileExists(Class2Box.Text));  ss1:=A.NrOfRows;  ss2:= B.NrOfRows;  SetLength(sampleperm, ss1+ss2);  for i:=1 to ss1 do   sampleperm[i−1]:= i;  for i:= 1 to ss2 do  sampleperm[ss1+i−1]:= ss1+i;  PermMat1:= TMatrix.Create(A.NrOfColumns,ss1);  PermMat2:= TMatrix.Create(B.NrOfColumns, ss2);  nH0rep:=Trunc(H0repInput.Value);  nsteps:= NExcludeSteps.Value; RandDistCurves:= TMatrix.Create(nH0rep, nsteps);  RandCV:=TMatrix.Create(nH0rep, nsteps);  if not Assigned(ClassificationForm)then   ClassificationForm:= TClassificationForm.Create(self);  iftestset then begin   ClassCurves:= TMatrix.Create(nH0rep, nsteps);  end; //calculate vector of gene-mean differences  MeanDiff:=TVector.Create(lines);  for i:= 1 to lines do   MeanDiff[i]:=B.Sum(i,1,i,ss2)/ss2−A.Sum(i,1,i,ss1)/ss1;  BatchProcess:= True; ExcludePB.Position:= 0;  ExcludePB.Max:= nH0rep+1;  try   for i:= 1 tonH0rep do begin   ExcludePB.StepIt;   //include all genes   for j:= 0 tohigh(UseGeneInd) do   UseGeneInd[j]:= 1;   //setup randomly permutedsamples   RandomPerm(sampleperm);   // bootstrap samples   { for j:= 0to ss1+ss2−1 do   sampleperm[j]:= Ran0(ss1+ss2−1)+1; }   for j:= 0 toss1−1 do   if sampleperm[j]>ss1 then begin    s:=Trunc(sampleperm[j])−ss1;    for k:= 1 to lines do    PermMat1[k,j+1]:=B[k,s]−ss1/(ss1+ss2)*MeanDiff[k];   end   else begin    s:=Trunc(sampleperm[j]);    for k:= 1 to lines do    PermMat1[k,j+1]:=A[k,s]+ss2/(ss1+ss2)*MeanDiff[k];   end; for j:= ss1 to ss1+ss2−1 do  ifsampleperm[j]>ss1 then begin   s:= Trunc(sampleperm[j])−ss1;   for k:= 1to lines do    PermMat2[k,j−ss1+1]:= B[k,s]−ss1/(ss1+ss2)*MeanDiff[k]; end  else begin   s:= Trunc(sampleperm[j]);   for k:= 1 to lines do   PermMat2[k,j−ss1+1]:= A[k,s]+ss2/(ss1+ss2)*MeanDiff[k];  end; if i=1then begin  PermMat1.StoreOnFile(1,1,100,ss1,‘perm1.txt’); PermMat2.StoreOnFile(1,1,100,ss2,‘perm2.txt’); end; //calculatedistance curve for this permutation  for j:= 1 to nsteps do begin DiffClusterForm.FindDiffClusterButtonClick(H0Button); DiffClusterForm.DistButtonClick(H0Button);  RandDistCurves[i,j]:=StrToFloat(DiffClusterForm.  DistOutput.Caption);  ifDiffClusterForm.AdjustType.ItemIndex <=2 then  ClassificationForm.AdjustOptions.ItemIndex:=   DiffClusterForm.AdjustType.ItemIndex; ClassificationForm.CrossvalidButtonClick(H0Button);  RandCV[i,j]:=StrToFloat(ClassificationForm.PCorrectOutput.Text);  if testset thenbegin   ClassCurves[i,j]:=   (StrToFloat(ClassifyTestSets(Class1Box.Text,True,True)) +    StrToFloat(ClassifyTestSets(Class2Box.Text,False,True)))/2;   ifClassCurves[i,j]=100 then    ShowMessage(‘Perfect again:(’);  end; OneEliminationStep;  end; end; { //calculate percentiles (min-max) RandDistCurves.Resize(nH0rep+2,nsteps);  for j:= 1 to nsteps do begin RandDistCurves.MinMax(1,j,nH0rep,j,stepmin,stepmax); RandDistCurves[nH0rep+1,j]:= stepmin;  RandDistCurves[nH0rep+2,j]:=stepmax;  end; } finally  PermMat1.Free;  PermMat2.Free; RandDistCurves.StoreOnFile(1, 1, nH0rep, nsteps, ‘randdistcurves.txt’); RandDistCurves.Free;  RandCV.StoreOnFile(1, 1, nH0rep, nsteps,‘randCV.txt’);  RandCV.Free;  MeanDiff.Free;  if testset then begin ClassCurves.StoreOnFile(1, 1, nH0rep, nsteps, ‘classcurves.txt’); ClassCurves.Free;  end;  ExcludePB.Position:= 0;  //include all genes for i:= 0 to high(UseGeneInd) do   UseGeneInd[i]:= 1;  BatchProcess:=False;  end; end; procedureTExcludeForm.RandomClusterButtonClick(Sender: TObject); var nH0rep, i,j, size: integer;  RandDist, RandClass: TVector;  testset: boolean; indexarr: array of double,  F: TextFile; begin  nH0rep:=Trunc(H0repInput.Value);  size:=Trunc(DiffClusterForm.NumElemInput.Value);  initprogress(ExcludePB,nH0rep);  testset:= (FileExists(Class1Box.Text)) and    (FileExists(Class2Box.Text));  RandDist:= TVector.Create(nH0rep); if testset then begin   RandClass:= TVector.Create(nH0rep);   if notAssigned(ClassificationForm) then    ClassificationForm:=TClassificationForm.Create(self);  end;  SetLength(indexarr, lines); for i:= 0 to lines−1 do   indexarr[i]:= i+1;  BatchProcess:= True;  try  for i:= 1 to nH0rep do begin    RandomPerm(indexarr);    withDiffClusterForm do begin    OutputMemo.Clear;    OutputMemo.Text:=IntToStr(Trunc(indexarr[0]));    for j:= 2 to size do    OutputMemo.Text:= OutputMemo.Text+ ‘, ’ +      IntToStr(Trunc(indexarr[j]));   DistButtonClick(RandomClusterButton);    RandDist[i]:=StrToFloat(DistOutput.Caption);    if testset then begin    RandClass[i]:=     (StrToFloat(ClassifyTestSets(Class1Box.Text,True,True)) +     StrToFloat(ClassifyTestSets(Class2Box.Text,False,True)))/2;    end;   end;    stepup(ExcludePB);   end;   finally    BatchProcess:= False;   initprogress(ExcludePB, 1);    AssignFile(F,‘randdist_class.txt’);   Rewrite(F);    if testset then begin    Writeln(F, ‘distance’,chr(9), ‘aveclass’);    for i:= 1 to nH0rep do     writeln(F,RandDist[i], chr(9), RandClass[i]);    CloseFile(F);    RandDist.Free;   RandClass.Free;  end  else begin   Writeln(F, ‘distance’);   for i:=1 to nH0rep do    writeln(F, RandDist[i]);   CloseFile(F);  RandDist.Free;   end;  end; end; procedureTExcludeForm.RunFromDiffClButtonClick(Sender: TObject); var GeneList:TStringList;  i, j, nsteps, nelem, 1length: integer;  testset: boolean;begin  GeneList:= TStringList.Create;  GeneList.CommaText:=DiffClusterForm.OutputMemo.Text;  1length:= GeneList.Count;  nelem:=Trunc(DiffClusterForm.NumElemInput.Value);  nsteps:=Trunc(1length/nelem);  ExcludeResult.RowCount:= nsteps+1;  testset:=(FileExists(Class1Box.Text)) and     (FileExists(Class2Box.Text)); ExcludePB.Position:= 0;  ExcludePB.max:= nsteps + 1;  for i:= 1 tonsteps do begin   try   ExcludePB.StepIt;   ExcludePB.Update;  BatchProcess:= True;   with DiffClusterForm.OutputMemo do begin   Clear;    Text:= GeneList[(i−1)*nelem];    for j:= 1 to nelem−1 do    Text:= Text + ‘, ’ + GeneList[(i−1)*nelem + j];   end;  DiffClusterForm.DistButtonClick(self);   if notAssigned(ClassificationForm) then    ClassificationForm:=TClassificationForm.Create(self);   ifDiffClusterForm.AdjustType.ItemIndex <=2 then   ClassificationForm.AdjustOptions.ItemIndex:=   DiffClusterForm.AdjustType.ItemIndex;  ClassificationForm.CrossvalidButtonClick(self);   ExcludeResult.Row:=i;   ExcludeResult.CellAsInt[0,i]:= i;   ExcludeResult.Cells[1,i]:=DiffClusterForm.OutputMemo.Text;   ExcludeResult.Cells[2,i]:=ClassificationForm.PCorrectOutput.   Text+‘%’;  ExcludeResult.Cells[3,i]:= DiffClusterForm.DistOutput.Caption;  ExcludeResult.Cells[4,i]:=   FloatToStrF(MinFreqInDiffcl*100,ffFixed,3,1)+‘%’;  ExcludeResult.Cells[5,i]:=   FloatToStrF(MaxFreqInDiffcl*100,ffFixed,3,1)+‘%’;   if testset thenbegin    ExcludeResult.Cells[6,i]:=   ClassifyTestSets(Class1Box.Text,True,False)+‘%’;   ExcludeResult.Cells[7,i]:=   ClassifyTestSets(Class2Box.Text,False,False)+‘%’;   end;   finally  BatchProcess:= False;   end;  end;  ExcludePB.Position:= 0; DiffClusterForm.OutputMemo.Text:= GeneList.CommaText  GeneList.Free;end; end.

EXAMPLE 2 Analysis on Microarray Expression Data from Colon Cancer CellLines

Two colon cancer cell lines were used in this experiment. HT29 cellsrepresent advanced, highly aggressive colon tumors. They containmutations in both the APC gene and p53 gene, two tumor suppressor genesthat frequently mutate during colon tumorigenesis. HCT116 cells manifestless aggressive colon tumors and harbor functional p53 and APC. They aredefective in DNA repair. The experiment was performed with three RNAsamples (1 μg RNA each). Cy-3-dCTP (green) was used to label HCT116cells while Cy-5-dCTP (red) was used for HT29 cells. Six independentreplicates were obtained each for HT29 and HCT116 cell lines. Inaddition, the data from a separate experiment was used as theindependent test set, which contained eight replicates for each cellline.

The analysis of differential expression of the two cell lines wascarried out similarly as the computer simulation study described supra.The number of permutation was set at 300 in this analysis (in accordancewith Procedure 2 supra), and the size of the subsets is k=5 (inaccordance with Procedure 4 supra). The results from application of thethree stopping rules (the associated probability distance, the test setclassification rate, and the cross validation classification rate) areshown in FIG. 2. The estimates of the corresponding null-distributionsare shown in FIG. 3.

Referring to FIG. 2, the left y axis represents the associatedprobability distance while the right y axis denotes the classificationrate based on the independent test set (hence test set classificationrate—“Class”) and the classification rate based on cross validationusing selected gene set (hence cross validation classificationrate—“CV”). The x axis denotes the number of subsets of genes with apredetermined size of 5. The dotted horizontal lines represent the levelof the 99th percentile of the null-distributions of the correspondingmeasures (i.e., the associated probability distance, the test setclassification rate, and the cross validation classification rate); theywere estimated by generating 300 random permutation samples that mimic“no-difference” data in accordance with Procedure 2 supra. Using the99th percentile of the null-distribution as the cutoff, the crossvalidation rate approach stops at the 57th subset and the distance-basedcriteria stops at 56th subset (referring to the black diamonds on thesolid lines “CV” and “Dist” in FIG. 2). Whereas, the smoothed (viaisotonic regression as discussed supra) test set classification ratedrops below the cutoff much earlier, at the 12th subset (referring tothe black diamond on the solid line “Class” in FIG. 2). However, is whenthe 95th percentile was used, the stopping points for all three measureswere at closer vicinity relative to one other. The extremely highvariability of the test set classification rate may be responsible forsuch discrepancy, since the test set data was generated in a separateand earlier experiment.

Further comparison was carried out between the multivariate randomsearch procedure of this invention and a univariate selection approach.The genes were sorted according to the values of the correspondingmarginal t-statistics and the top 56×5=280 genes were selected, suchthat the number of selected genes was identical to that identified bythe multivariate distance-based cutoff criterion as discussed above. Itwas observed that the resulting two groups of differentially expressedgenes share only 94 genes (33%). The first gene that did not appear inthe group selected using the univariate approach—Hs.2867,Interferon-alpha induced 11.5 KD protein—appeared in the fourth subsetG₄ of genes identified by the multivariate selection approach.Interestingly, another copy of the same gene did appear in both groupsand that corresponding changes in the interferon pathway in the HT29cell line are well known. Therefore, identification of this gene asbeing differentially expressed was an accurate conclusion and, themultivariate search method of this invention was advantageously moresensitive compared to the univariate approach.

It is to be understood that the description, specific examples and data,while indicating exemplary embodiments, are given by way of illustrationand are not intended to limit the present invention. Various changes andmodifications within the present invention will become apparent to theskilled artisan from the discussion, disclosure and data containedherein, and thus are considered part of the invention.

1. A method for identifying a set of genes from a multiplicity of geneswhose expression levels at a first state and a second state are measuredin replicates using one or more nucleotide arrays, thereby generating afirst plurality of independent measurements of the expression levels forsaid first state and a second plurality of independent measurements ofthe expression levels for said second state, which method comprises thefollowing sequential steps: (a) identifying a quality function capableof evaluating the distinctiveness between the first plurality and thesecond plurality; (b) forming a first predetermined number ofpermutations from the first and the second pluralities, dividing saidpermutations into a first permutated plurality and a second permutatedplurality, corresponding in size, to said first and second plurality,respectively, and identifying groups of genes the size of which is asecond predetermined number, wherein the values of the quality functionfor the group of genes in said first permutated and second permutatedpluralities attain the maximum; (c) determining, from said first andsecond permutated pluralities, the top α^(th) percentile of the nulldistribution based on a quantitative characteristic of said groups ofgenes; (d) identifying, based on the first and second pluralities, asubset of genes the size of which is said second predetermined number,wherein the values of the quality function for said subset of genes insaid first and second pluralities attain the maximum; (e) adding to theset of genes, said subset, if the value of said quantitativecharacteristic associated with said subset exceeds said top α^(th)percentile of the null distribution; and (f) removing from the first andsecond pluralities, all measurements on said subset, if the maximumvalue of the quality function associated with said subset exceeds saidtop α^(th) percentile of the null distribution, and repeating steps(d)-(f) until no more measurements are left in the first and secondpluralities or the value of said quantitative characteristic associatedwith the subset does not exceed said top α^(th) percentile of the nulldistribution.
 2. The method of claim 1, wherein said states are selectedfrom the group consisting of biological states, physiological states,pathological states, and prognostic states.
 3. A method for identifyinga set of genes from a multiplicity of genes whose expression levels in afirst tissue and a second tissue are measured in replicates using one ormore nucleotide arrays, thereby generating a first plurality ofindependent measurements of the expression levels for said first tissueand a second plurality of independent measurements of the expressionlevels for said second tissue, which method comprises: (a) identifying aquality function capable of evaluating the distinctiveness between thefirst plurality and the second plurality; (b) forming a firstpredetermined number of permutations from the first and the secondpluralities, dividing said permutations into a first permutatedplurality and a second permutated plurality, corresponding in size tosaid first and second plurality, respectively, and identifying groups ofgenes the size of which is a second predetermined number, wherein thevalues of the quality function for the group of genes in said firstpermutated and second permutated pluralities attain the maximum; (c)determining, from said first and second permutated pluralities, the topα^(th) percentile of the null distribution based on a quantitativecharacteristic of said groups of genes; (d) identifying, based on thefirst and second pluralities, a subset of genes the size of which issaid second predetermined number, wherein the values of the qualityfunction for said subset of genes in said first and second pluralitiesattain the maximum; (e) adding to the set of genes, said subset, if thevalue of said quantitative characteristic associated with said subsetexceeds said top α^(th) percentile of the null distribution; and (f)removing from the first and second pluralities, all measurements on saidsubset, if the maximum value of the quality function associated withsaid subset exceeds said top α^(th) percentile of the null distribution,and repeating steps (d)-(f) until no more measurements are left in thefirst and second pluralities or the value of said quantitativecharacteristic associated with the subset does not exceed said topα^(th) percentile of the null distribution.
 4. The method of claim 3,wherein said tissues are selected from the group consisting of normallung tissues, cancer lung tissues, normal heart tissues, pathologicalheart tissues, normal and abnormal colon tissues, normal and abnormalrenal tissues, normal and abnormal prostate tissues, and normal andabnormal breast tissues.
 5. A method for identifying a set of genes froma multiplicity of genes whose expression levels in a first type of cellsand a second type of cells are measured in replicates using one or morenucleotide arrays, thereby generating a first plurality of independentmeasurements of the expression levels for said first type of cells and asecond plurality of independent measurements of the expression levelsfor said second types of cells, which method comprises: (a) identifyinga quality function capable of evaluating the distinctiveness between thefirst plurality and the second plurality; (b) forming a firstpredetermined number of permutations from the first and the secondpluralities, dividing said permutations into a first permutatedplurality and a second permutated plurality, corresponding in size, tosaid first and second plurality, respectively, and identifying groups ofgenes the size of which is a second predetermined number, wherein thevalues of the quality function for the group of genes in said firstpermutated and second permutated pluralities attain the maximum; (c)determining, from said first and second permutated pluralities, the topα^(th) percentile of the null distribution based on a quantitativecharacteristic of said groups of genes; (d) identifying, based on thefirst and second pluralities, a subset of genes the size of which issaid second predetermined number, wherein the values of the qualityfunction for said subset of genes in said first and second pluralitiesattain the maximum; (e) adding to the set of genes, said subset, if thevalue of said quantitative characteristic associated with said subsetexceeds said top α^(th) percentile of the null distribution; and (f)removing from the first and second pluralities, all measurements on saidsubset, if the maximum value of the quality function associated withsaid subset exceeds said top α^(th) percentile of the null distribution,and repeating steps (d)-(f) until no more measurements are left in thefirst and second pluralities or the value of said quantitativecharacteristic associated with the subset does not exceed said topα^(th) percentile of the null distribution.
 6. The method of claim 5,wherein said types of cells are selected from the group consisting ofnormal lung cells, cancer lung cells, normal heart cells, pathologicalheart cells, normal and abnormal colon cells, normal and abnormal renalcells, normal and abnormal prostate cells, and normal and abnormalbreast cells.
 7. The method of claim 5, wherein said type of cells areselected from the group consisting of cultured cells and cells isolatedfrom an organism.
 8. The method of claim 1, 3, or 5, wherein saidquality function is a probability distance function.
 9. The method ofclaim 8, wherein said probability distance function is selected from thegroup consisting of the Mahalanobis distance and the Bhattacharyadistance.
 10. The method of claim 8, wherein the probability distancefunction is defined as:N(μ,ν)=2∫_(R) _(d) ∫_(R) _(d) L(x,y)dμ(x)dν(y)−∫_(R) _(d) ∫_(R) _(d)L(x,y)dμ(x)dμ(y)−∫_(R) _(d) ∫_(R) _(d) L(x,y)dν(x)dν(y) where μ and νare two probability measures defined on the Euclidean space, and L(x,y)is a strictly negative definite kernel.
 11. The method of claim 10,wherein the negative definite kernel is combined with the Euclideandistance between x and y to form a composite kernel function.
 12. Themethod of claim 1, 3, or 5, wherein the quantitative characteristic isselected from the group consisting of an associated probabilitydistance, a test set classification rate, and a cross validationclassification rate.
 13. The method of claim 1, 3, or 5, wherein theformation of the permutations further comprises: (i) shifting themeasurements in the first and second pluralities such that the marginalmeans thereof share the same true mean; and (ii) randomly permuting theresulting shifted measurements thereby forming a null-distribution ofpermutations.
 14. The method of claim 1, 3, or 5, wherein theidentifying further comprises: (i) calculating the values of the qualityfunction for said subset of genes in said first and second pluralitiesthereby evaluating the distinctiveness of said first and secondpluralities; (ii) substituting a gene in said subset with one outside ofsaid subset, thereby generating a new subset, and repeating step (i),keeping the new subset if the distinctiveness increases and the originalsubset if otherwise; and (iii) repeating steps (i) and (ii) for a fourthpredetermined number of times.
 15. The method of claim 1, 3, or 5,wherein the identifying further comprises: (i) randomly dividing thefirst and the second pluralities into v groups of an approximate equalsize; (ii) removing one of said v groups from said first and secondpluralities and identifying, from the resulting reduced first and secondpluralities, a subset of genes for which the value of said qualityfunction attains the maximum; and (iii) repeating step (ii) for each ofsaid v groups thereby obtaining v subsets of genes.
 16. The method ofclaim 1, 3, or 5, wherein the nucleotide arrays are selected from thegroup consisting of arrays having spotted thereon cDNA sequences andarrays having synthesized thereon oligonucleotides.